Skip to content

BUG-22796 Concat multicolumn tz-aware DataFrame #23036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Oct 9, 2018

Conversation

tonytao2012
Copy link
Contributor

@tonytao2012 tonytao2012 commented Oct 8, 2018

Numpy arrays don't have the datetimetz dtype, so I just passed through the DatetimeIndex directly.

Side note: There's another small bug (I think) where np.nan or pd.NaT takes on the dtype of the column instead of the row when concatenating, but the column should instead have an object dtype.

@pep8speaks
Copy link

pep8speaks commented Oct 8, 2018

Hello @tonytao2012! Thanks for updating the PR.

Comment last updated on October 09, 2018 at 12:08 Hours UTC

@@ -53,6 +53,21 @@ def test_concat_multiple_tzs(self):
expected = DataFrame(dict(time=[ts2, ts3]))
assert_frame_equal(results, expected)

def test_concat_tz_NaT(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test where ts1 is NaT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it results in an object column instead of the correct dtype. I'm not entirely sure how to solve this issue, as it appears to happen when the dataframe is created before the concat even occurs.

ts1 = pd.Timestamp(pd.NaT, tz='UTC')
df1 = pd.DataFrame([[ts1, ts2]])

df1[0]
Out[37]: 
0   NaT
Name: 0, dtype: datetime64[ns]

Any ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That behavior is theoretically correct if NaT was replaced by a naive Timestamp. NaT is an edge case where it can exist with a datetime64[ns, tz] dtype.

Don't worry about solving that in this PR. Feel free to open up another issue about that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, should I still add the new test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add the test with ts1 as NaT and xfail it, but link it to a new issue.

@mroeschke
Copy link
Member

The bug you're describing may be #12499

@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Oct 8, 2018
@codecov
Copy link

codecov bot commented Oct 8, 2018

Codecov Report

Merging #23036 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23036      +/-   ##
==========================================
- Coverage   92.19%   92.19%   -0.01%     
==========================================
  Files         169      169              
  Lines       50904    50911       +7     
==========================================
+ Hits        46933    46939       +6     
- Misses       3971     3972       +1
Flag Coverage Δ
#multiple 90.61% <100%> (-0.01%) ⬇️
#single 42.3% <14.28%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/concat.py 98% <100%> (-0.38%) ⬇️
pandas/core/dtypes/dtypes.py 95.58% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca7d518...de4c427. Read the comment docs.

@tonytao2012
Copy link
Contributor Author

Added the xfail test and linked to issue #23037. Thanks for the feedback.

@@ -53,6 +55,24 @@ def test_concat_multiple_tzs(self):
expected = DataFrame(dict(time=[ts2, ts3]))
assert_frame_equal(results, expected)

@pytest.mark.parametrize('t1', ['2015-01-01',
pytest.param('pd.NaT', marks=pytest.mark.xfail(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is pd.NaT in quotes? Shouldn't it just be pd.NaT?

@@ -883,6 +883,7 @@ Reshaping
- Bug in :func:`pandas.wide_to_long` when a string is passed to the stubnames argument and a column name is a substring of that stubname (:issue:`22468`)
- Bug in :func:`merge` when merging ``datetime64[ns, tz]`` data that contained a DST transition (:issue:`18885`)
- Bug in :func:`merge_asof` when merging on float values within defined tolerance (:issue:`22981`)
- Bug in :func:`pandas.concat` when merging multicolumn DataFrames with tz-aware data (:issue`22796`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific here? The isn't an issue with all multi-column DataFrames with tz-aware data. What are the exact conditions needed?

@@ -186,6 +188,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):

if getattr(self.block, 'is_datetimetz', False) or \
is_datetimetz(empty_dtype):
if self.block is None:
missing_arr = np.full(self.shape, fill_value)
missing_time = DatetimeIndex(missing_arr[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u walk back up the stack and see exactly where block is None- this is a guarantee iirc of this function to not be None so this might be an error higher up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block is none when there are a mismatched number of columns between the two concatenating dataframes. I assumed this was correct behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I last changed this, I was really trying NOT to use DTI directly here as we are trying isolate things like that to higher level methods.

See if you can revise, otherwise I will take a look.

Copy link
Contributor Author

@tonytao2012 tonytao2012 Oct 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in that case, I'm not sure what to do. Clearly, block is supposed to be None in cases where the columns are mismatched:

    for blkno, placements in libinternals.get_blkno_placements(blknos,
                                                               mgr.nblocks,
                                                               group=False):

        assert placements.is_slice_like

        join_unit_indexers = indexers.copy()

        shape = list(mgr_shape)
        shape[0] = len(placements)
        shape = tuple(shape)

        if blkno == -1:
            unit = JoinUnit(None, shape)

When block is None, we have to create and return some array in get_reindexed_values, but np arrays can't have a tz dtype. I apologize if I'm missing something obvious. Feel free to take over the issue if you'd like, as I'm unsure of how to continue from here. I'll also keep thinking about it.

@@ -186,6 +188,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):

if getattr(self.block, 'is_datetimetz', False) or \
is_datetimetz(empty_dtype):
if self.block is None:
missing_arr = np.full(self.shape, fill_value)
missing_time = DatetimeIndex(missing_arr[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I last changed this, I was really trying NOT to use DTI directly here as we are trying isolate things like that to higher level methods.

See if you can revise, otherwise I will take a look.

@jreback jreback added this to the 0.24.0 milestone Oct 9, 2018
@TomAugspurger
Copy link
Contributor

Just to make sure, on this PR I get

In [1]: import pandas as pd

In [2]: a = pd.DataFrame({"A": pd.to_datetime([1, 2]).tz_localize("UTC")})

In [3]: b = pd.DataFrame({"A": pd.to_datetime([1, 2]).tz_localize("UTC"), "B": pd.to_datetime([1, 2]).tz_localize("UTC")})

In [4]: pd.concat([a, b], sort=True).B.values
Out[4]:
array([                          'NaT', '1970-01-01T00:00:00.000000001',
       '1970-01-01T00:00:00.000000002'], dtype='datetime64[ns]')

In [5]: pd.concat([a, b], sort=True)
Out[5]:
                                    A                                   B
0 1970-01-01 00:00:00.000000001+00:00                                 NaT
1 1970-01-01 00:00:00.000000002+00:00 1970-01-01 00:00:00.000000001+00:00
0 1970-01-01 00:00:00.000000001+00:00 1970-01-01 00:00:00.000000002+00:00
1 1970-01-01 00:00:00.000000002+00:00

the xfail you added @jreback is for the fact that B in Out[5] is incorrect?

@jreback
Copy link
Contributor

jreback commented Oct 9, 2018

hmm, that does look suspicious but that's not the reason for the xfail.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2018

pushed an updated, that was a just-introduced bug.

@jreback
Copy link
Contributor

jreback commented Oct 9, 2018

one more time to fix the lint issue.

@jreback jreback merged commit 574eb75 into pandas-dev:master Oct 9, 2018
@jreback
Copy link
Contributor

jreback commented Oct 9, 2018

thanks @tonytao2012

welcome to have you work on the issue that we are xfailing

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concat of dataframe with tz-aware datetime column against dataframe without, fails
5 participants